[restatectl] Improve performance and information display with unprovisioned clusters or under partial node connectivity #2748

pcholakov · 2025-02-17T09:36:00Z

This PR aims to improve the overall responsiveness of restatectl with clusters that are still busy provisioning or otherwise only partially connected to the host running the tool.

iterating over the nodes for nodes config or metadata remembers unresponsive nodes
Metadata status polls the known metadata role servers concurrently, respecting unreachable nodes flagged by prior connections
overall CLI connect timeout is reduced to 3s (from 5s)
we now print a heading for the Metadata section of restatectl status to visually separate it from the others
general sprinkling of debug logging

Further work:

update get_latest_metadata to gather IdentResponses concurrently and cache them (so they can be reused between get nodes and logs)
be smarter with contacting nodes based on the returned status

Testing

Single alive node of a three-node cluster (not yet provisioned):

❯ rc st -s node1.cluster.orb.local:5122
Node Configuration (v3)
 NODE  GEN  NAME   ADDRESS                               ROLES                                         
 N1    2    node1  http://node1.cluster.orb.local:5122/  admin | log-server | metadata-server | worker 

Logs v1
└ Logs Provider: replicated
 ├ Log replication: {node: 2}
 └ Nodeset size: 0
No logs found. Has the cluster been provisioned yet?

Alive partition processors (nodes config v3, partition table v1)
 P-ID  NODE  MODE  STATUS  LEADER  EPOCH  SEQUENCER  APPLIED-LSN  PERSISTED-LSN  SKIPPED-RECORDS  ARCHIVED-LSN  LAST-UPDATE 

Metadata service nodes
 NODE  STATUS  VERSION  LEADER  MEMBERS  APPLIED  COMMITTED  TERM  LOG-LENGTH  SNAP-INDEX  SNAP-SIZE 
 N1    Member  v1       N1      [N1]     2        2          2     1           1           472 B

Single dead node:

❯ time rc st -s node2.cluster.orb.local:5122
Error: Encountered multiple errors:
 - http://node2.cluster.orb.local:5122/ -> status: Unavailable, message: "tcp connect error: deadline has elapsed", details: [], metadata: MetadataMap { headers: {} }

restatectl st -s node2.cluster.orb.local:5122  0.03s user 0.01s system 0% cpu 5.348 total

Provisioned cluster with two alive and one dead nodes:

❯ time rc st -s node1.cluster.orb.local:5122
Node Configuration (v9)
 NODE  GEN  NAME   ADDRESS                               ROLES                                         
 N1    2    node1  http://node1.cluster.orb.local:5122/  admin | log-server | metadata-server | worker 
 N2    1    node2  http://node2.cluster.orb.local:5122/  admin | log-server | metadata-server | worker 
 N3    1    node3  http://node3.cluster.orb.local:5122/  admin | log-server | metadata-server | worker 

Logs v3
└ Logs Provider: replicated
 ├ Log replication: {node: 2}
 └ Nodeset size: 0
 L-ID  FROM-LSN  KIND        LOGLET-ID  REPLICATION  SEQUENCER  NODESET      
 0     2         Replicated  0_1        {node: 2}    N2:1       [N1, N2, N3] 
 1     2         Replicated  1_1        {node: 2}    N2:1       [N1, N2, N3] 
 2     2         Replicated  2_1        {node: 2}    N1:2       [N1, N2, N3] 
 3     2         Replicated  3_1        {node: 2}    N1:2       [N1, N2, N3] 
...

Alive partition processors (nodes config v9, partition table v4)
 P-ID  NODE  MODE      STATUS  LEADER  EPOCH  SEQUENCER  APPLIED-LSN  PERSISTED-LSN  SKIPPED-RECORDS  ARCHIVED-LSN  LAST-UPDATE             
 0     N1:2  Follower  Active  N2:1    e1                1            -              0                -             582 ms ago              
 0     N2:1  Leader    Active  N2:1    e1                1            -              0                -             802 ms ago              
 1     N1:2  Follower  Active  N2:1    e1                1            -              0                -             1 second and 92 ms ago  
 1     N2:1  Leader    Active  N2:1    e1                1            -              0                -             802 ms ago              
 2     N1:2  Leader    Active  N1:2    e1                1            -              0                -             517 ms ago              
 2     N2:1  Follower  Active  N1:2    e1                1            -              0                -             608 ms ago              
 3     N1:2  Leader    Active  N1:2    e1                1            -              0                -             971 ms ago              
 3     N2:1  Follower  Active  N1:2    e1                1            -              0                -             923 ms ago              
...

☠️ Dead nodes
 NODE  LAST-SEEN                           
 N3    1 minute, 33 seconds and 503 ms ago 

Metadata service nodes
 NODE  STATUS  VERSION  LEADER  MEMBERS     APPLIED  COMMITTED  TERM  LOG-LENGTH  SNAP-INDEX  SNAP-SIZE 
 N2    Member  v3       N1      [N1,N2,N3]  41       41         2     4           37          8.8 kiB   
 N1    Member  v3       N1      [N1,N2,N3]  41       41         2     4           37          8.8 kiB   

🔌 Unreachable nodes
 NODE  REASON                                                                                              
 N3    status: Unknown, message: "Node is unreachable", details: [], metadata: MetadataMap { headers: {} } 
restatectl st -s node1.cluster.orb.local:5122  0.06s user 0.02s system 1% cpu 5.405 total

~/restate/restate feat/restatectl-node-connections* 5s ❯

github-actions · 2025-02-17T09:56:16Z

Test Results

7 files ±0 7 suites ±0 4m 15s ⏱️ -1s
47 tests ±0 46 ✅ ±0 1 💤 ±0 0 ❌ ±0
182 runs ±0 179 ✅ ±0 3 💤 ±0 0 ❌ ±0

Results for commit f3208bf. ± Comparison against base commit a470649.

♻️ This comment has been updated with latest results.

…ntly

muhamadazmy

Thank you so much @pcholakov for those really nice improvements. I like also getting the metadata status in parallel. It made me think that there is probably more things that can be done in parallel, including fetching the ConnectionInfo::get_latest_metadata. What do you think?

pcholakov · 2025-02-17T13:45:56Z

Yeah, absolutely! I'd definitely love to continue this by fetching metadata in parallel. I can imagine a scatter-gather version ConnectionInfo::try_each, for example, which reaches out to the desired number of nodes concurrently, and adds more tasks if needed if some responses from the initial batch fail. Lots of room for improvement :-) I also started down a path of creating a standalone ConnectionCache struct but it turned out that I needed to lock the connection cache and dead node set separately. Definitely some room for further evolution there, too!

pcholakov added 2 commits February 17, 2025 09:13

Fix typo in reconfigure

c352f0e

Update headings for logs and metadata list

cdc4026

pcholakov changed the title ~~Feat/restatectl node connections~~ [restatectl] Improve performance and information display with unprovisioned clusters or under partial node connectivity Feb 17, 2025

pcholakov requested a review from muhamadazmy February 17, 2025 10:14

pcholakov added 2 commits February 17, 2025 13:56

Remember unresponsive nodes, refresh metadata service status concurre…

f1149a9

…ntly

Lower the default CLI connect timeout to 3s

0fc6cb9

pcholakov force-pushed the feat/restatectl-node-connections branch from 484c27f to 0fc6cb9 Compare February 17, 2025 11:56

Sort metadata servers by node id

f3208bf

pcholakov force-pushed the feat/restatectl-node-connections branch from 97368a9 to f3208bf Compare February 17, 2025 12:38

pcholakov marked this pull request as ready for review February 17, 2025 12:38

muhamadazmy approved these changes Feb 17, 2025

View reviewed changes

pcholakov mentioned this pull request Feb 17, 2025

Compact cluster status #2714

Merged

pcholakov merged commit fbc82b5 into main Feb 17, 2025
29 checks passed

pcholakov deleted the feat/restatectl-node-connections branch February 17, 2025 15:28

pcholakov mentioned this pull request Feb 17, 2025

Improve restatectl to display nodes that are provisioning #2453

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[restatectl] Improve performance and information display with unprovisioned clusters or under partial node connectivity #2748

[restatectl] Improve performance and information display with unprovisioned clusters or under partial node connectivity #2748

pcholakov commented Feb 17, 2025 •

edited

Loading

github-actions bot commented Feb 17, 2025 •

edited

Loading

muhamadazmy left a comment

pcholakov commented Feb 17, 2025

[restatectl] Improve performance and information display with unprovisioned clusters or under partial node connectivity #2748

[restatectl] Improve performance and information display with unprovisioned clusters or under partial node connectivity #2748

Conversation

pcholakov commented Feb 17, 2025 • edited Loading

Testing

github-actions bot commented Feb 17, 2025 • edited Loading

Test Results

muhamadazmy left a comment

Choose a reason for hiding this comment

pcholakov commented Feb 17, 2025

pcholakov commented Feb 17, 2025 •

edited

Loading

github-actions bot commented Feb 17, 2025 •

edited

Loading